-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#116 Allow token processing "middleware" #144
base: master
Are you sure you want to change the base?
Conversation
lib/index.js
Outdated
if (!labels.hasOwnProperty(obj)) continue; | ||
|
||
// Check for negation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I originally forked the repo, here we used to check if the previous word was a negation and invert the score. Now seems like we deal with negation further down. So we end up negating the score twice. Might be that my change to deal with negative words is redundant now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @nsantini, great work. Just a couple of points:
-
The spell-checking feature looks good to me. However, we have to consider if we should add an option to enable / disable this feature (are there any use-cases where disabling spell-checking is required? Performance perhaps?). @thisandagain thoughts?
-
Regarding negation: since PR Add support for additional languages #128 for supporting additional languages, this logic has now moved to
/languages/en/scoring-strategy.js
, or more generally/languages/[language]/scoring-strategy.js
. The scoring strategy determines what score to assign to a given token, and allows processing negation, emphasis, and any other language specifics.
You are currently performing the negation twice, that is what's making your tests fail. -
It would be great to have a few validation tests for the backward lookup feature, there doesn't seem to be any.
Ok
|
Finally, made spell checking optional, and unit tested it |
Thanks for the quick update @nsantini, great work!
We're almost there! |
The current implementation uses https://www.npmjs.com/package/spellchecker , which are native bindings to NSSpellChecker, Hunspell, or the Windows 8 Spell Check API, depending on your platform. Windows 7 and below as well as Linux will rely on Hunspell. So it depends on what language you have in your system. Also it depends on those subsystems to support multiple languages. |
Does this mean that, for example, if my OS is set to use the spanish locale, then the spellchecker module would always try to spellcheck with spanish as the target language, regardless of the actual language of the input? If this is the case, I feel like this would lead to a very inconsistent behaviour. Is there a way to specify to the spellchecker which language to use? |
There is a mechanism to add a dictionary to the spellchecker, I believe thats how a language gets "set" |
I have refactored the spell checking functionality to use a different library nspell that can support multiple languages. |
This is all great work, we're almost there!
For example: // languages/fr/index.js
module.exports = {
labels: require('./labels.json'),
dictionary: require('dictionary-fr'), // specify the language dictionary here for spellchecking
scoringStrategy: require('./scoring-strategy')
};
After:
Ideally there should be barely any impact when the spellchecking feature is disabled (which it should be, by default). Could you please investigate this issue?
Thank you for your patience @nsantini and for bearing with me! |
In addition to the performance implications of this change necessitating the feature be "off" by default, I think the validation tests also strongly suggest that this should be optional: Before
After
Nice improvement (~1%) on the Amazon validation set, but the IMDB accuracy dropped ~1.1% and Yelp stayed stable. This may suggest that the change is indeed helping with less formal speech, but is actually causing false positives in a more formal corpus (or vs versa), but more investigation would be required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good @nsantini! This is great to see. Agreed with @elyas-bhy's suggestions above. I also would love to further discuss the edit distance strategy and validation test differences as described in the comments.
languages/en/distance.js
Outdated
|
||
/** | ||
* Finds the closest match between a statement and a body of words using | ||
* Levenshtein Distance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Levenshtein Distance is a great performant strategy, but if we are going to make this optional anyway we might want to discuss / test alternative edit distance algorithms (e.g. Myers Diff Algorithm).
PDF:
myers.pdf
Ok, moved the spell checking to the library and left loading the dictionary to the language. Added extra unit test, only two lines on the spell checking are not covered, but they are there for safekeeping of an edge case that shouldnt happen. sentiment (Latest) - Short x 493,203 ops/sec ±1.14% (90 runs sampled) Amazon accuracy: 0.7302697302697303 The changes to use spell checking, off by default, should not have affected accuracy, since I haven't changed how is tested. |
languages/en/dictionary.js
Outdated
if (dictiaonary === null) { | ||
var base = require.resolve('dictionary-en-us'); | ||
dictiaonary = { | ||
'aff': read(base.replace('.js', '.aff'), 'utf-8'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why do we need to perform such gymnastics when loading the dictionary, instead of simply requiring it and passing it over to nspell
, as described in their usage example?
Also, there is a typo in the spelling of the "dictionary" variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the typo.
Regarding "gymnastics": the dictionary can be loaded on an async fashion. But the whole sentiment library works synchronously. So I decided to not pollute the whole library with callbacks or promises, and given that I couldn't use async/await to do it (eslint complained about it) I decided to load it this way.
README.md
Outdated
getDictionary: { | ||
apply: function() { | ||
// Load a dictionary for the language for nspell to use | ||
return { aff, dic }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do these aff
and dic
properties correspond to? Could you give an example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained more in the README file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to be the case, are you sure that you have pushed your changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new commit, it references the explanation I gave in the "Spell checked example" section
Hi, latest changes:
|
LGTM. @thisandagain? |
@thisandagain any more comments on this? Regarding your comment about looking for alternative string distance algorithms, I would suggest leave this PR as it is (since the current implementation is performant and correct), and open a new issue to look for alternatives |
@thisandagain any calls regarding this PR? |
@elyas-bhy @thisandagain any updates? |
Everything looks good on my side. Still waiting for @thisandagain to approve the changes. |
hi @thisandagain , just checking if you got a change to review the changes |
Ping @thisandagain. |
Should I close this PR? |
I'm really looking forward to get this merged, but need @thisandagain to approve and merge this PR. |
Ping @thisandagain |
I added node-spellchecker to check for typos, and also "levenshtein" to find the closest spell correction to the original word.
I also modified the "negation" feature to look backwards until a negation word or a new afinn word is found, to cover cases like "not too bad".